Advanced Statistical Analysis of Fitness Data for Predictive Insights & Personalization

Research Project (SMST604)

Purva Amit Puranik(3032411004)

DES Pune University — Dept. of Statistics

Omkar Nilesh Ninav(3532411003)

DES Pune University — Dept. of Statistics

2025-10-14

Motivation Behind the Project

  • Fitness data reflecting exercise patterns, body composition, and lifestyle factors offer valuable opportunities for quantitative analysis.
  • Understanding how these factors interact can help reveal key determinants of progress and explain variations in individual performance.
  • The project focuses on building statistical and predictive models to detect performance plateaus and forecast outcomes for fitness improvement.
  • Metrics such as the Resilience Index and Progress Ratio aim to make analytics more interpretable and actionable for individual users.

Proposed Research Questions

  1. To what extent does consistent workout completion lead to a significant improvement in body fat percentage, e.g., meeting weekly or monthly targets?
  2. Can a Bayesian Structural Time Series (BSTS) model accurately forecast stagnation in key physiological metrics (e.g., body fat %, BMI) at least one week in advance?
  3. What distinct client clusters emerge from an unsupervised analysis of longitudinal activity and progress data (e.g., Fast Responders, At-Risk Plateauers)?
  4. How can Resilience Index and Progress Ratio be quantitatively defined and operationalized, and do these metrics correlate with long-term user success and engagement?

Technical Details and Domain Knowledge

Interdisciplinary Scope

  • Combines statistics, exercise physiology, and wearable technology.
  • Uses synthetically generated data designed to reflect real-world exercise and lifestyle patterns.
  • Bridges the gap between data science and human performance understanding.

Technical Details and Domain Knowledge

Personalization and Adaptation

  • Applies clustering to identify groups with similar fitness responses.
  • Uses adaptive models that update as new data are generated.
  • Enables personalized progress tracking and interpretable metrics such as the Resilience Index and Progress Ratio.

Statistical Methodologies

General Analysis

  • Descriptive statistics to summarize fitness metrics (e.g., mean, variance, correlation).
  • Inferential tests to examine the impact of workout consistency on body fat %.
  • Regression modeling to quantify relationships between input features and outcomes.

Statistical Methodologies

Predictive Modeling

  • Bayesian Structural Time Series (BSTS) to forecast stagnation in metrics like body fat % or BMI.
  • Model validation through posterior predictive checks and forecast accuracy measures (e.g., MAE, RMSE).
  • Helps in early detection of plateaus and performance forecasting.

Statistical Methodologies

Unsupervised Learning

  • Clustering algorithms (e.g., K-Means, Hierarchical, or DBSCAN) to segment users.
  • Identify behavioral groups such as Fast Responders or At-Risk Plateauers.
  • Use Principal Component Analysis (PCA) for dimensionality reduction and visualization.

Statistical Methodologies

Custom Metrics and Correlation

  • Define Resilience Index (recovery and adaptability measure) and Progress Ratio (rate of improvement).
  • Validate these metrics using correlation and regression analysis against long-term success indicators.

Proposed Analytical Pipeline & Flow

  • The analytical workflow integrates data preprocessing, sampling, and four core research questions (RQ1–RQ4).
  • Covers causal inference, time-series forecasting, clustering, and personalized metric modeling.

RQ1 – Causal Effect of Workout Consistency

  • Estimate effect of workout consistency on body fat % using Marginal Structural Models (MSM).
  • Apply Inverse Probability of Treatment Weighting (IPTW) for bias adjustment.
  • Validate robustness through sensitivity and model diagnostics.

Proposed Analytical Pipeline & Flow

RQ2 – Plateau Forecasting

  • Fit Bayesian Structural Time Series (BSTS) models to forecast stagnation in fat %.
  • Evaluate forecast accuracy using posterior probabilities, precision, recall, and calibration curves.

Proposed Analytical Pipeline & Flow

RQ3 – User Segmentation

  • Perform Principal Component Analysis (PCA) for dimensionality reduction.
  • Identify clusters using K-Means and Bayesian Gaussian Mixture Models (BGMM).
  • Validate cluster stability and interpret behavioral–physiological profiles.

Proposed Analytical Pipeline & Flow

RQ4 – Personalized Metrics

  • Compute Progress Ratio and Resilience Index from user trajectories.
  • Assess associations using OLS regression and Cox survival models.
  • Apply SHAP values for interpretability and insight into feature importance.

Details About the Data

Category Variables Distribution Statistical Justification
Physical & Biometric Height, Weight, Chest, Waist, Neck, BMI Truncated Normal Reflects natural variation; truncation ensures realistic physiological limits; preserves proportional relationships among traits
Behavioral & Categorical Exercise Schedule, Consistency, Gender Bernoulli / Categorical Discrete behavioral variables; probabilities reflect empirical proportions
Exercise Performance Sets, Reps, Weight Lifted Log-Normal / Discrete Normal Captures skewed exercise output; discrete normal for symmetric count data like reps
Temporal Progress Weekly Weight, Body Fat %, Caloric Change AR(1) / Logistic Decay Models temporal autocorrelation and diminishing returns in physiological adaptation
Lifestyle Sleep Hours, Calories Burned Uniform / Normal Captures wide individual variability without strong skew

Summary of Research Papers

Piwek et al. (2016) — The Rise of Consumer Health Wearables

  • Discusses the exponential growth of wearable health technology and its potential for personalized analytics.
  • Highlights barriers like data accuracy, non-adherence, and missingness, motivating rigorous statistical modeling of behavioral data.
  • Establishes the need for predictive, interpretable models beyond descriptive tracking.

Summary of Research Papers

Fitzmaurice et al. (2011) — Applied Longitudinal Analysis

  • Provides a rigorous framework for modeling repeated measures with correlated errors.
  • Emphasizes time-dependent confounding and the limitations of standard regression approaches.
  • Directly supports our Marginal Structural Models (MSM) and Inverse Probability of Treatment Weighting (IPTW) approach in RQ1.
  • Ensures causal validity when estimating effects of workout consistency on body fat percentage.

Summary of Research Papers

James et al. (2013) & Hastie et al. (2009) — Statistical Learning Foundations

  • Together, these texts define the statistical learning theory guiding model design and validation.
  • Provide a foundation for regression, classification, clustering, PCA, and regularization.
  • Support RQ3 (user segmentation) through dimensionality reduction (PCA) and Bayesian Gaussian Mixture Models (BGMM).
  • Emphasize the bias–variance tradeoff, cross-validation, and generalization performance in predictive modeling.

Summary of Research Papers

Klein & Moeschberger (2012) — Survival Analysis: Techniques for Censored and Truncated Data

  • Formalizes methods for handling censored and time-to-event data.
  • Supports modeling of time-to-goal achievement as a survival process.
  • Enables linking Resilience Index and Progress Ratio with probability of long-term success using Cox proportional hazards models.
  • Integrates traditional survival analysis with predictive modeling for personalized insight.

Thank You

Advanced Statistical Analysis of Fitness Data for Predictive Insights and Personalization

Presented by:

Purva Puranik
M.Sc. Statistics

Omkar Ninav
M.Sc. Statistics

Questions & Discussion

Integrating statistical theory with behavioral science for actionable fitness analytics.